skip to main content


Search for: All records

Creators/Authors contains: "Jain, Rishabh"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Deep Learning Recommendation Models (DLRMs) are very popular in personalized recommendation systems and are a major contributor to the data-center AI cycles. Due to the high computational and memory bandwidth needs of DLRMs, specifically the embedding stage in DLRM inferences, both CPUs and GPUs are used for hosting such workloads. This is primarily because of the heavy irregular memory accesses in the embedding stage of computation that leads to significant stalls in the CPU pipeline. As the model and parameter sizes keep increasing with newer recommendation models, the computational dominance of the embedding stage also grows, thereby, bringing into question the suitability of CPUs for inference. In this paper, we first quantify the cause of irregular accesses and their impact on caches and observe that off-chip memory access is the main contributor to high latency. Therefore, we exploit two well-known techniques: (1) Software prefetching, to hide the memory access latency suffered by the demand loads and (2) Overlapping computation and memory accesses, to reduce CPU stalls via hyperthreading to minimize the overall execution time. We evaluate our work on a single-core and 24-core configuration with the latest recommendation models and recently released production traces. Our integrated techniques speed up the inference by up to 1.59x, and on average by 1.4x. 
    more » « less
  2. Deep neural networks (DNNs) are increasingly popular owing to their ability to solve complex problems such as image recognition, autonomous driving, and natural language processing. Their growing complexity coupled with the use of larger volumes of training data (to achieve acceptable accuracy) has warranted the use of GPUs and other accelerators. Such accelerators are typically expensive, with users having to pay a high upfront cost to acquire them. For infrequent use, users can, instead, leverage the public cloud to mitigate the high acquisition cost. However, with the wide diversity of hardware instances (particularly GPU instances) available in public cloud, it becomes challenging for a user to make an appropriate choice from a cost/performance standpoint. In this work, we try to address this problem by (i) introducing a comprehensive distributed deep learning (DDL) profiler Stash, which determines the various execution stalls that DDL suffers from, and (ii) using Stash to extensively characterize various public cloud GPU instances by running popular DNN models on them. Specifically, it estimates two types of communication stalls, namely, interconnect and network stalls, that play a dominant role in DDL execution time. Stash is implemented on top of prior work, DS-analyzer, that computes only the CPU and disk stalls. Using our detailed stall characterization, we list the advantages and shortcomings of public cloud GPU instances for users to help them make an informed decision(s). Our characterization results indicate that the more expensive GPU instances may not be the most performant for all DNN models and that AWS can sometimes sub-optimally allocate hardware interconnect resources. Specifically, the intra-machine interconnect can introduce communication overheads of up to 90% of DNN training time and the network-connected instances can suffer from up to 5× slowdown compared to training on a single instance. Furthermore, (iii) we also model the impact of DNN macroscopic features such as the number of layers and the number of gradients on communication stalls, and finally, (iv) we briefly discuss a cost comparison with existing work. 
    more » « less
  3. The use of potassium (K) metal anodes could result in high-performance K-ion batteries that offer a sustainable and low-cost alternative to lithium (Li)-ion technology. However, formation of dendrites on such K-metal surfaces is inevitable, which prevents their utilization. Here, we report that K dendrites can be healed in situ in a K-metal battery. The healing is triggered by current-controlled, self-heating at the electrolyte/dendrite interface, which causes migration of surface atoms away from the dendrite tips, thereby smoothening the dendritic surface. We discover that this process is strikingly more efficient for K as compared to Li metal. We show that the reason for this is the far greater mobility of surface atoms in K relative to Li metal, which enables dendrite healing to take place at an order-of-magnitude lower current density. We demonstrate that the K-metal anode can be coupled with a potassium cobalt oxide cathode to achieve dendrite healing in a practical full-cell device.

     
    more » « less
  4. Abstract

    Graphite anodes offer low volumetric capacity in lithium‐ion batteries. By contrast, tellurene is expected to alloy with alkali metals with high volumetric capacity (≈2620 mAh cm−3), but to date there is no detailed study on its alloying behavior. In this work, the alloying response of a range of alkali metals (A = Li, Na, or K) with few‐layer Te is investigated. In situ transmission electron microscopy and density functional theory both indicate that Te alloys with alkali metals forming A2Te. However, the crystalline order of alloyed products varies significantly from single‐crystal (for Li2Te) to polycrystalline (for Na2Te and K2Te). Typical alloying materials lose their crystallinity when reacted with Li—the ability of Te to retain its crystallinity is therefore surprising. Simulations reveal that compared to Na or K, the migration of Li is highly “isotropic” in Te, enabling its crystallinity to be preserved. Such isotropic Li transport is made possible by Te's peculiar structure comprising chiral‐chains bound by van der Waals forces. While alloying with Na and K show poor performance, with Li, Te exhibits a stable volumetric capacity of ≈700 mAh cm−3, which is about twice the practical capacity of commercial graphite.

     
    more » « less
  5. Abstract

    Optical density (OD) is widely used to estimate the density of cells in liquid culture, but cannot be compared between instruments without a standardized calibration protocol and is challenging to relate to actual cell count. We address this with an interlaboratory study comparing three simple, low-cost, and highly accessible OD calibration protocols across 244 laboratories, applied to eight strains of constitutive GFP-expressingE. coli. Based on our results, we recommend calibrating OD to estimated cell count using serial dilution of silica microspheres, which produces highly precise calibration (95.5% of residuals  <1.2-fold), is easily assessed for quality control, also assesses instrument effective linear range, and can be combined with fluorescence calibration to obtain units of Molecules of Equivalent Fluorescein (MEFL) per cell, allowing direct comparison and data fusion with flow cytometry measurements: in our study, fluorescence per cell measurements showed only a 1.07-fold mean difference between plate reader and flow cytometry data.

     
    more » « less